K-means

It is a clustering algorithm based on unsupervised learning. The main goal of this algorithm is to find groups in the data. Each group is called a cluster and has a centroid. The letter K is used to represent the number of groups, and therefore, the number of centroides. It is an iterative algorithm to group data in clusters with similar properties. Below you can see the three steps of the K-means algorithm. You can search videos on the Internet to better understand this algorithm.

The algorithm begins by randomly choose K samples (rows) from the input data set. These K samples are set as the K centroids of the cluster (group). You can the K-means++ algorithm described later.
Each sample (row) in the input data set is assigned to a cluster (or group). In fact, each input sample is assigned to cluster with the closest centroid. Thus, each centroid will have a set of input rows assigned.
Centroids are recalculated using the mean value of all the sample in each cluster. We repeat steps 2 and 3 until the centroids stop moving at each iteration.

Es un algoritmo de agrupamiento basado en aprendizaje sin supervisión. La meta principal de este algoritmo es encontrar grupos en los datos. Cada grupo es llamado racimo (cluster) y tiene un centroide. La letra K se usa para representar el número de grupos y por lo tanto el número de centroides. Este es un algoritmo iterativo para agrupar de datos con propiedades similares. Debajo se muestran los tres pasos del algoritmo de K-means. Usted puede buscar videos en la Internet para entender mejor este algoritmo.

El algoritmo comienza seleccionando K muestras (renglones) del conjunto de datos de entrada. Estas K muestras son usadas como los K centro de los grupos. Usted puede usar el algoritmo K-means++ descrito después.
A cada muestra (renglón) en los datos de entrada es asignada a un grupo. De hecho, a cada muestra de entrada se le asigna el grupo que tenga el centroide mas cercano. Así, cada centroide tendrá un grupo de renglones de entrada asignados.
Los centroides son recalculados usando el valor promedio de todas las muestras en el grupo. Se repiten los pasos 2 y 3 hasta que los centroides ya no se mueven en cada iteración.

K-means++

It is an initialization method that can improve the performance of the algorithm.

Randomly choose the first centroid from the data
For each input data x, compute the distance D(x) between this input data and the nearest centroided that has been chosen
Randomly choose a new input data as a new centroid using a probability proportional to the squared of the distance, D(x)²
Repeat steps two and three until all the centers have been chosen

Este es un método de inicialización que puede mejorar el desempeño del algoritmo.

Escoger en forma aleatoria el primer centroide de entre los datos
Para cada dato de entrada, calcular la distancia entre este dato de entrada y el centroide más cercano que ya se ha escogido
Escoger aleatoria un nuevo punto de entrada como un nuevo centroide usando una probabilidad proporcional al cuadrado de la distancia, D(x)²
Repetir los pasos dos y tres hasta que todos los centroides se han escogido

Problem 1

Create a Dialog application using Wintempla called Wine to classify a synthetic database using the K-means algorithm.

Cree una aplicación de Diálogo a usando Wintempla llamada Wine para clasificar una base de datos sintética usando el algoritmo de K-means.

Wine.h

#pragma once //______________________________________ Wine.h
#include "Resource.h"
#define CLUSTER_SIZE 1000

class Wine: public Win::Dialog
{
public:
     Wine()
     {
          randGen.seed(::GetTickCount());
     }
     ~Wine()
     {
     }
     std::mt19937 randGen;
     void CreateCluster(double centerX, double centerY, double centerZ, double radius, size_t pointCount, MATRIX& out_data);
     . . .
};

Wine.cpp

. . .
void Wine::Window_Open(Win::Event& e)
{
     //________________________________________________________ lvCluster
     lvCluster.Cols.Add(0, LVCFMT_LEFT, 100, L"Index");
     lvCluster.Cols.Add(1, LVCFMT_RIGHT, 100, L"Cluster");
     //________________________________________________________ lvCentroid
     lvCentroid.Cols.Add(0, LVCFMT_LEFT, 80, L"No");
     lvCentroid.Cols.Add(1, LVCFMT_RIGHT, 80, L"X");
     lvCentroid.Cols.Add(2, LVCFMT_RIGHT, 80, L"Y");
     lvCentroid.Cols.Add(3, LVCFMT_RIGHT, 80, L"Z");
     //________________________________________________________
     size_t i;
     MATRIX dataset;
     MATRIX cluster;
     //________________________________________________________ 1. Cluster at (1, 1, 1)
     CreateCluster(1.0, 1.0, 1.0, 0.95, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 2. Cluster at (-1, 1, 1)
     CreateCluster(-1.0, 1.0, 1.0, 0.85, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 3. Cluster at (1, -1, 1)
     CreateCluster(1.0, -1.0, 1.0, 0.97, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 4. Cluster at (1, 1, -1)
     CreateCluster(1.0, 1.0, -1.0, 0.94, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 5. Cluster at (-1, -1, 1)
     CreateCluster(-1.0, -1.0, 1.0, 0.96, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 6. Cluster at (1, -1, -1)
     CreateCluster(1.0, -1.0, -1.0, 0.86, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 7. Cluster at (-1, 1, -1)
     CreateCluster(-1.0, 1.0, -1.0, 0.93, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 8. Cluster at (-1, -1, -1)
     CreateCluster(-1.0, -1.0, -1.0, 0.92, CLUSTER_SIZE, cluster);
     for (i = 0; i < CLUSTER_SIZE; i++) dataset.push_back(cluster[i]);
     //________________________________________________________ 9. Execute K-means algoritm
     MATRIX centroid;
     vector<int> result;
     Math::KMeans kmeans;
     kmeans.CreateClusters(100, dataset, 8, centroid, result);
     //________________________________________________________ 10. Display cluster
     wchar_t text[64];
     const size_t rowCount = result.size();
     for (i = 0; i < rowCount; i++)
     {
          _snwprintf_s(text, 64, _TRUNCATE, L"%d", i+1);
          lvCluster.AddItem(text, 0);
          //
          _snwprintf_s(text, 64, _TRUNCATE, L"%d", (int)result[i]);
          lvCluster.SetItemText(i, 1, text);
     }
     //________________________________________________________ 11. Display centroids
     for (i = 0; i < 8; i++)
     {
          _snwprintf_s(text, 64, _TRUNCATE, L"%d", i+1);
          lvCentroid.AddItem(text, 0);
          //_______________________ X
          _snwprintf_s(text, 64, _TRUNCATE, L"%.2f", centroid[i][0]);
          lvCentroid.SetItemText(i, 1, text);
          //_______________________ Y
          _snwprintf_s(text, 64, _TRUNCATE, L"%.2f", centroid[i][1]);
          lvCentroid.SetItemText(i, 2, text);
          //_______________________ Z
          _snwprintf_s(text, 64, _TRUNCATE, L"%.2f", centroid[i][2]);
          lvCentroid.SetItemText(i, 3, text);
     }
}

void Wine::CreateCluster(double centerX, double centerY, double centerZ, double radius, size_t pointCount, MATRIX& out_data)
{
     std::uniform_real<double> distribution(-radius, radius);
     if (out_data.size() != pointCount) out_data.resize(pointCount);
     const double sqrad= radius*radius;
     double distance;
     for (size_t i = 0; i < pointCount; i++)
     {
          out_data[i].resize(3);
          while (true)
          {
               out_data[i][0] = distribution(randGen);
               out_data[i][1] = distribution(randGen);
               out_data[i][2] = distribution(randGen);
               distance = out_data[i][0]*out_data[i][0] + out_data[i][1]*out_data[i][1] + out_data[i][2]*out_data[i][2];
               if (distance < sqrad) break;
          }
          out_data[i][0] += centerX;
          out_data[i][1] += centerY;
          out_data[i][2] += centerZ;
     }
}

K-means

K-means

K-means++

© Copyright 2000-2021 Wintempla selo. All Rights Reserved. Jul 22 2021. Home